Skip to content

ARROW-17303: [Java][Dataset] Read Arrow IPC files by NativeDatasetFactory (#13760)#13811

Merged
lidavidm merged 5 commits into
apache:masterfrom
igor-suhorukov:master
Aug 8, 2022
Merged

ARROW-17303: [Java][Dataset] Read Arrow IPC files by NativeDatasetFactory (#13760)#13811
lidavidm merged 5 commits into
apache:masterfrom
igor-suhorukov:master

Conversation

@igor-suhorukov

@igor-suhorukov igor-suhorukov commented Aug 7, 2022

Copy link
Copy Markdown
Contributor

This PR allow developers to create Dataset from ARROW IPC files in JVM code like:
FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(), FileFormat.ARROW_IPC, arrowDatasetURL);

It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion

@github-actions

github-actions Bot commented Aug 7, 2022

Copy link
Copy Markdown

@github-actions

github-actions Bot commented Aug 7, 2022

Copy link
Copy Markdown

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

@lidavidm

lidavidm commented Aug 8, 2022

Copy link
Copy Markdown
Member

Thanks for the PR!

@davisusanibar @lwhite1 would one of you mind taking a look?

Is "osm_nodes.arrow" from OpenStreetMap? Are there licensing concerns around the data? Arrow already has test data files for use and/or files can be generated in-process.

@igor-suhorukov

igor-suhorukov commented Aug 8, 2022

Copy link
Copy Markdown
Contributor Author

@lidavidm yes, it is 10 records from Openstreetmap planet dump. Could you please provide more information how to generate test data in ARROW file format to test dataset API or where existing test data located?

@lidavidm

lidavidm commented Aug 8, 2022

Copy link
Copy Markdown
Member

It'd be something like

File out = TMP.newFile();
Schema schema = new Schema(Collections.singletonList(Field.nullable("ints", new ArrowType.Int(32, true))));
try (VectorSchemaRoot root = VectorSchemaRoot.create(schema, allocator);
     FileOutputStream fileOutputStream = new FileOutputStream(file);
     ArrowFileWriter writer = new ArrowFileWriter(root, /*dictionaryProvider=*/null, sink)) {
    // Fill root with data
    IntVector ints = (IntVector) root.getVector(0);
    ints.setSafe(0, 0);
    root.setRowCount(1);
    // ...
    writer.start();
    writer.writeBatch();
    writer.end();
}
// Use out.getPath()...

@igor-suhorukov

Copy link
Copy Markdown
Contributor Author

@lidavidm thank you for advise. OSM data was deleted from PR. Please check updated test TestFileSystemDataset#testBaseArrowIpcRead
Is it fit project test approach?

@lwhite1

lwhite1 commented Aug 8, 2022

Copy link
Copy Markdown
Contributor

Hi @igor-suhorukov This looks good to me except I wish the tests were more robust. (The same is true for the Parquet test that you're emulating, but I guess that's out of scope here.)

This kind of test - relying on checking sizes and names - doesn't provide much assurance that we won't see bug reports when people import complex data types or otherwise tap into some of the more advanced functionality.

@lidavidm

lidavidm commented Aug 8, 2022

Copy link
Copy Markdown
Member

@lwhite1 we could file another JIRA for that?

@lidavidm

lidavidm commented Aug 8, 2022

Copy link
Copy Markdown
Member

Also a general note re: Larry's comment: we currently have a mix of JUnit 4/5, ad-hoc test helpers like the one here, and a mix of assertion libraries; it might be good to start incrementally cleaning that up (e.g. it would be much easier to test complex types if there were an easy setup to parameterize a test and have the data generated for you).

ARROW-6931 is sort of related, and ARROW-4740 (we added JUnit5 but didn't port the existing tests)

@lwhite1

lwhite1 commented Aug 8, 2022 via email

Copy link
Copy Markdown
Contributor

@lidavidm

lidavidm commented Aug 8, 2022

Copy link
Copy Markdown
Member

I suggest a separate ticket because 1) generating test data is very unergonomic (as seen here) and could use some thought across different areas of the codebase and 2) I'd rather push down the testing to the appropriate levels (IPC, Parquet, and eventually CSV should share most of their testing code, the same way the C++ library is organized; and most of the type-specific tests should be done for the C Data Interface)

@lwhite1

lwhite1 commented Aug 8, 2022

Copy link
Copy Markdown
Contributor

I suggest a separate ticket because 1) generating test data is very unergonomic (as seen here) and could use some thought across different areas of the codebase and 2) I'd rather push down the testing to the appropriate levels (IPC, Parquet, and eventually CSV should share most of their testing code, the same way the C++ library is organized; and most of the type-specific tests should be done for the C Data Interface)

Ok. Works for me.

@lwhite1 lwhite1 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lidavidm

lidavidm commented Aug 8, 2022

Copy link
Copy Markdown
Member

I filed https://issues.apache.org/jira/browse/ARROW-17342

@lidavidm

lidavidm commented Aug 8, 2022

Copy link
Copy Markdown
Member

FWIW, looking at the JIRA/GH issue, this will only handle "IPC" files, not Arrow stream files - there's work needed on the C++ side if that is something we want to cover

@lidavidm lidavidm merged commit 78351ce into apache:master Aug 8, 2022
@igor-suhorukov

Copy link
Copy Markdown
Contributor Author

Thanks a lot for clarification @lidavidm @lwhite1 and for your time. Don't worries about refactoring. I have such experience with Spring/ElasticSearch projects refactoring, fix tech debt and cleanup - it can be contribution of crowd when Arrow project will be more mature - separate activities for new joiners. Good start for someone

@ursabot

ursabot commented Aug 9, 2022

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = a2f3666 and contender = 78351ce. 78351ce is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.34% ⬆️0.0%] test-mac-arm
[Finished ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.14% ⬆️0.04%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 78351cec ec2-t3-xlarge-us-east-2
[Finished] 78351cec test-mac-arm
[Finished] 78351cec ursa-i9-9960x
[Finished] 78351cec ursa-thinkcentre-m75q
[Finished] a2f3666d ec2-t3-xlarge-us-east-2
[Finished] a2f3666d test-mac-arm
[Finished] a2f3666d ursa-i9-9960x
[Finished] a2f3666d ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Yicong-Huang added a commit to apache/texera that referenced this pull request Dec 13, 2022
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0.

Main changes related to PyAmber:

## Java/Scala side:

- JDBC Driver for Arrow Flight SQL
([13800](apache/arrow#13800))
- Initial implementation of immutable Table API
([14316](apache/arrow#14316))
- Substrait, transaction, cancellation for Flight SQL
([13492](apache/arrow#13492))
- Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory
([13811](apache/arrow#13811),
[13973](apache/arrow#13973),
[14182](apache/arrow#14182))
- Add utility to bind Arrow data to JDBC parameters
([13589](apache/arrow#13589))

## Python side:

- The batch_readahead and fragment_readahead arguments for scanning
Datasets are exposed in Python
([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)).
- ExtensionArrays can now be created from a storage array through the
pa.array(..) constructor
([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)).
- Converting ListArrays containing ExtensionArray values to numpy or
pandas works by falling back to the storage array
([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)).
- Casting Tables to a new schema now honors the nullability flag in the
target schema
([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
pribor pushed a commit to GlobalWebIndex/arrow that referenced this pull request Oct 24, 2025
…tory (apache#13760) (apache#13811)

This PR allow developers to create Dataset from ARROW IPC files in JVM code like:
`FileSystemDatasetFactory factory = new FileSystemDatasetFactory(rootAllocator(), NativeMemoryPool.getDefault(),
            FileFormat.ARROW_IPC, arrowDatasetURL);`

It is foundation for Apache Spark arrow data source to process huge existing partitioned datasets in ARROW file format without additional data format conversion

Lead-authored-by: Igor Suhorukov <igor.suhorukov@gmail.com>
Co-authored-by: igor.suhorukov <igor.suhorukov@gmail.com>
Signed-off-by: David Li <li.davidm96@gmail.com>
yangzhang75 pushed a commit to yangzhang75/texera that referenced this pull request Jun 22, 2026
This PR bumps Apache Arrow version from 9.0.0 to 10.0.0.

Main changes related to PyAmber:

## Java/Scala side:

- JDBC Driver for Arrow Flight SQL
([13800](apache/arrow#13800))
- Initial implementation of immutable Table API
([14316](apache/arrow#14316))
- Substrait, transaction, cancellation for Flight SQL
([13492](apache/arrow#13492))
- Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory
([13811](apache/arrow#13811),
[13973](apache/arrow#13973),
[14182](apache/arrow#14182))
- Add utility to bind Arrow data to JDBC parameters
([13589](apache/arrow#13589))

## Python side:

- The batch_readahead and fragment_readahead arguments for scanning
Datasets are exposed in Python
([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)).
- ExtensionArrays can now be created from a storage array through the
pa.array(..) constructor
([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)).
- Converting ListArrays containing ExtensionArray values to numpy or
pandas works by falling back to the storage array
([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)).
- Casting Tables to a new schema now honors the nullability flag in the
target schema
([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Read "arrow" (IPC and streaming) files usning org.apache.arrow.dataset.jni.NativeDatasetFactory in Java API

4 participants